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Abstract 

We consider regularized support vector machines (SVMs) and show that they are precisely 
equivalent to a new robust optimization formulation. We show that this equivalence of 
robust optimization and regularization has implications for both algorithms, and analysis. 
In terms of algorithms, the equivalence suggests more general SVM-likc algorithms for 
classification that explicitly build in protection to noise, and at the same time control 
overfitting. On the analysis front, the equivalence of robustness and regularization, provides 
a robust optimization interpretation for the success of regularized SVMs. We use the this 
new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) 
SVMs, thus establishing robustness as the reason regularized SVMs generalize well. 

Keywords: Robustness, Regularization, Generalization, Kernel, Support Vector Machine 
1. Introduction 



Support Vector Machines (SVMs for short) origin ated in iBoser et al . (1992 ) and can b e 
traced back to as early as Vapnik and Lerner ( 19631 ) and Vapnik and Chervonenkis ( 19741 )- 
They continue to be one of the most successful algorithms for classification. SVMs ad- 
dress the classification problem by finding the hyperplane in the feature space that achieves 
maximum sample margin when the training samples are separable, which leads to mini- 
mizing the norm of the classifier. When the samples a re not separable, a penalty term 
that approximates the to tal training error is considered ([Bennett and Mangasarianl . Il99l 
Cortes and Vapnikl . Il995l ) . It is well known that minimizing the training error itself can lead 
to poor classification performance for new unlabeled data; that is, such an approach ma 
have poor generalization error because of, essentially, overfitting ( Vapnik and Chervonenki; 
19911 ) . A variety of modifications have been proposed to combat this problem, one of the 
most popular methods being that of minimizing a combination of the training-error and 
a regularization term. The latter is typically chosen as a norm of the classifier. The 
resulting regularized classifier performs better on new data. This phenomenon is often 
interpreted from a statistical learning theory view: the regularization term restricts the 
complexity of the classifier, hence the deviation of the test i ng error and the training error 
is controlled (see Smola et al. . 1998 : Evgeniou et al. . 2000l : Bartlett and Mendelson . 20021 : 



I 



Koltchinskii and Panchenkol . I2OO2I : 



Bartlett et al 



20051 . and references therein). 
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In this paper we consider a different setup, assuming that the training data are gen- 
erated by the true underlying distribution, but some non-i.i.d. (potentially adversarial) 
dist urbance is then added to the samples we observe. We fol l ow a robust optimization 



(see El Ghaoui and Lebret . 1997 ; Ben-Tal and Nemirovski . 19991 ; Bertsimas and Sim . 20041 . 
and references therein) approach, i.e., minimizing the worst possible empirical error un- 
der such disturbances. The use of robust optimizat i on in classificat i on is not new (e.g., 
Shivaswamy et al. . 20061 : Bhattacharyya et al. . 2004bl : Lanckriet et al. . 2002 ). Robust clas- 
sification models studied in the past have considered only box-type uncertainty sets, which 
allow the possibility that the data have all been skewed in some non-neutral manner by a 
correlated disturbance. This has made it difficult to obtain non-conservative generalization 
bounds. Moreover, there has not been an explicit connection to the regularized classi- 
fier, althoug h at a high-level it is known that regularization and robu st optimization are 
related (e.g., El Ghaoui and Lebret . 1997 ; Anthony and Bartlett . 19991 ). The main contri- 
bution in this paper is solving the robust classification problem for a class of non-box-typed 
uncertainty sets, and providing a linkage between robust classification and the standard 
regularization scheme of SVMs. In particular, our contributions include the following: 

• We solve the robust SVM formulation for a class of non-box-type uncertainty sets. 
This permits finer control of the adversarial disturbance, restricting it to satisfy ag- 
gregate constraints across data points, therefore reducing the possibility of highly 
correlated disturbance. 



• We show that the standard regularized SVM classifier is a special case of our robust 
classification, thus explicitly relating robustness and regularization. This provides 
an alternative explanation to the success of regularization, and also suggests new 
physically motivated ways to construct regularization terms. 

• We relate our robust formulation to several probabilistic formulations. We consider 
a chance-constrained classifier (i.e., a classifier with probabilistic constraints on mis- 
classification) and show that our robust formulation can approximate it far less con- 
servatively than previous robust formulations could possibly do. We also consider 
a Bayesian setup, and show that this can be used to provide a principled means of 
selecting the regularization coefficient without cross-validation. 

• We show that the robustness perspective, stemming from a non-i.i.d. analysis, can 
be useful in the standard learning (i.i.d.) setup, by using it to prove consistency 
for standard SVM classification, without using VC- dimension or stability arguments. 
This result implies that generalization ability is a direct result of robustness to local 
disturbances; it therefore suggests a new justification for good performance, and conse- 
quently allows us to construct learning algorithms that generalize well by robustifying 
non-consistent algorithms. 

Robustness and Regularization 

We comment here on the explicit equivalence of robustness and regularization. We briefly ex- 
plain how this observation is different from previous work and why it is interesting. Certain 
equivalence relationships between robustness and regularization have been established for 
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prob l ems oth e r than classification (|E1 Ghaoui and Lebretl . 119971 ; iBen-Tal and Nemirovskil . 
19991 : iBishopl . Il995l ). but their results do not directly apply to the classification prob- 
lem. Indeed, research on classifier regularizat ion mainly d i scuss e s its effect on bound- 

' 200d 



ing the complexity of the function class Ce.g.. ISmola et al 



1998 



Evgeniou et al 



Bartlett and Mendelsonl . 12002 ; iKoltchinskii and Panchenkol . 12002 ; Bartlett et all 120051 ) . Mean 



while, resear ch on robust cla s sifica t ion has not attempted to rel at e robustness and regular 



ization (e.g., Lanckriet et al. L 2002 1 Bhattacharvva et all 2004al lbl: IShivaswamy et all . 20061 



Trafalis and Gilbert 



2007 
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Glob er son and Roweia . 120061 ) , in part due to the robustness for- 



mulations used i n those papers. In fac t, they all consider robustified versions of regularized 
classifications!]] iBhattacharvva ( 2004 ) considers a robust formulation for box- type uncer- 
tainty, and relates this robust formulation with regularized SVM. However, this formulation 
involves a non-standard loss function that does not bound the — 1 loss, and hence its phys- 
ical interpretation is not clear. 

The connection of robustness and regularization in the SVM context is important for the 
following reasons. First, it gives an alternative and potentially powerful explanation of the 
generalization ability of the regularization term. In the classical machine learning literature, 
the regularization term bounds the complexity of the class of classifiers. The robust view 
of regularization regards the testing samples as a perturbed copy of the training samples. 
We show that when the total perturbation is given or bounded, the regularization term 
bounds the gap between the classification errors of the SVM on these two sets of samples. 
In contrast to the standard PAC approach, this bound depends neither on how rich the 
class of candidate classifiers is, nor on an assumption that all samples are picked in an 
i.i.d. manner. In addition, this suggests novel approaches to designing good classification 
algorithms, in particular, designing the regularization term. In the PAC structural-risk 
minimization approach, regularization is chosen to minimize a bound on the generalization 
error based on the training error and a complexity term. This complexity term typically 
leads to overly emp hasizing the regular izer, and indeed this approach is known to often 
be too pessimistic ( Kearns et al. . 19971 ) for problems with more structure. The robust 
approach offers another avenue. Since both noise and robustness are physical processes, a 
close investigation of the application and noise characteristics at hand, can provide insights 
into how to properly robustify, and therefore regularize the classifier. For example, it is 
known that normalizing the samples so that the variance among all features is roughly the 
same (a process commonly used to eliminate the scaling freedom of individual features) often 
leads to good generalization performance. From the robustness perspective, this simply says 
that the noise is anisotropic (ellipsoidal) rather than spherical, and hence an appropriate 
robustification must be designed to fit this anisotropy. 

We also show that using the robust optimization viewpoint, we obtain some probabilistic 
results outside the PAC setup. In Section [3] we bound the probability that a noisy training 
sample is correctly labeled. Such a bound considers the behavior of corrupted samples and is 
hence different from the known PAC bounds. This is helpful when the training samples and 
the testing samples are drawn from different distributions, or some adversary manipulates 
the samples to prevent them from being correctly labeled (e.g., spam senders change their 
patterns from time to time to avoid being labeled and filtered). Finally, this connection of 



1. lLanckriet et al] (|2002h is perhaps the only exception, where a regularization term is added to the covari- 
ance estimation rather than to the objective function. 
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robustification and regularization also provides us with new proof techniques as well (see 
Section [5]). 

We need to point out that there are several different definitions of robustness in litera- 
ture. In this paper, as well as the aforementioned robust classification papers, robustness 
is mainly understood from a Robust Optimization perspective, where a min-max optimiza- 
tion is performed over all possible disturbances. An alter native i nterp r etation of robustness 



stems from the rich literature on Robust Stati s tics ( e.g., Huber . 1981 ; Hampel et al. . 19861 ; 



Rousseeuw and Leeroyl . 119871 : iMaronna et all 120061 ). which studies how an estimator or 



algorithm behaves under a small pertur bation o f the s tatist ics model. For exam ple, the In- 
fluence Function approach, proposed in lllamp"el (jl974T ) and lHampel et alJ (Il98fih . measures 
the impact of an infinitesimal amount of contamination of the original distribution on the 
quantity of interest. Based on this notion of robustness, IChristmann and Steinwartl (|2004l ) 
showed that many kernel classification algorithms, including SVM, are robust in the sense 
of having a finite Influence Funct ion. A similar result for regressi on algorithms is shown in 
Christmann and Steinwartl (|2Q07I ) for smooth loss functions, and in lChristmann and Van Messem 
(J2008J) for non-smooth loss functions where a relaxed version of the Influence Function is 
applied. In the machine learning literature, another widely used notion closely related to 
robustness is the stability, where an algorithm is required to be robust (in the sense that 
the output function does not change significantly) under a specific perturbation: deleting 
one sample from the training set. It is now well known that a stable algorithm such as 
SVM has desirable generalization properties, and is statis t ically c onsistent under mi l d tech - 
nical conditions; se e for example Bousauet and Elisseefi ( 2002 ): Kutin and Niyogi ( 20021 ): 
Poggio et al.1 (|2004l ); iMukheriee et al.1 (|2006l ) for details. One main difference between Ro- 
bust Optimization and other robustness notions is that the former is constructive rather 
than analytical. That is, in contrast to robust statistics or the stability approach that mea- 
sures the robustness of a given algorithm, Robust Optimization can robustify an algorithm: 
it converts a given algorithm to a robust one. For example, as we show in this paper, the RO 
version of a naive empirical-error minimization is the well known SVM. As a constructive 
process, the RO approach also leads to additional flexibility in algorithm design, especially 
when the nature of the perturbation is known or can be well estimated. 

Structure of the Paper: This paper is organized as follows. In Section[2]we investigate 
the correlated disturbance case, and show the equivalence between the robust classification 
and the regularization process. We develop the connections to probabilistic formulations 
in Section El and prove a consistency result based on robustness analysis in Section [5j 
The kernelized version is investigated in Section [H Some concluding remarks are given in 
Section El 

Notation: Capital letters are used to denote matrices, and boldface letters are used to 
denote column vectors. For a given norm || • ||, we use || • ||* to denote its dual norm, i.e., 
||z||* = sup{z T x| ||x|| < 1}. For a vector x and a positive semi-definite matrix C of the same 
dimension, ||x||c denotes V x T Cx. We use 5 to denote disturbance affecting the samples. 
We use superscript r to denote the true value for an uncertain variable, so that is the 
true (but unknown) noise of the i th sample. The set of non-negative scalars is denoted by 
]R + . The set of integers from 1 to n is denoted by [1 : n]. 
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2. Robust Classification and Regularization 

We consider the standard binary classification problem, where we are given a finite number 
of training samples {xj,yj}^L 1 C M. n x {— 1,+1}, and must find a linear classifier, specified 
by the function /i w ' 6 (x) = sgn((w, x) + b). For the standard regularized classifier, the 
parameters (w, b) are obtained by solving the following convex optimization problem: 



mm 

S.t. 



r w 



& > [l-y i ((w,x i ) + b)] 
where r(w, b) is a regularization term. This is equivalent to 



mm 



r w 



b) + max [1 - yi ((w, Xi ) + b), 0] 



Previous robus t class i fication work dShivaswamv et"aL 20061 : Bhattacharyya et al. . 2004al jbl: 
Bhattacharvval . I2004J ; iTrafalis and Gilbertl . 120071 ) considers the classification problem where 
the input are subject to (unknown) disturbances S = (Si, . . . , S m ) and essentially solves the 
following min-max problem: 



min max < r(w, b) + > max \l — yi((w, Xj — Si) + b), 0] 
w ' 6 SeM box \ 



(1) 



for a box-type uncertainty set A/" box . That is, let Mi denotes the projection of M box onto 
the Si component, then A/" box = A/i x • • • x M m . Effectively, this allows simultaneous worst- 
case disturbances across many samples, and leads to overly conservative solutions. The 
goal of this paper is to obtain a robust formulation where the disturbances {Si} may be 
meaningfully taken to be correlated, i.e., to solve for a non-box-type N: 



minmax < r(w, b) + > max \l — yi((w, Xj — Si) + b), Ol 



(2) 



We briefly explain here the four reasons that motivate this "robust to perturbation" setup 
and in particular the min-max form of (pQ) a nd (El). Fi r st, it can explicitly incorporate prior 
problem knowledge of local invariance (e.g, Teo et al. . 20081 ). For example, in vision tasks, 
a desirable classifier should provide a consistent answer if an input image slightly changes. 
Second, there are situations where some adversarial opponents (e.g., spam senders) will 
manipulate the testing samples to avoid being correctly classified, and the robustness to- 
ward such manipulation sho uld be taken into consideration in the training process (e.g, 
Globerson and Roweisl . 20061 ). Or alternatively, the training samples and the testing sam- 
ples can be o btained from differe nt processes and hence the standard i.i.d. assumption is 
violated (e.g, Bi and Zhangj . 20041 ) . For example in real-time applications, the newly gener- 
ated samples are often le ss accurate due to time constraints. Finally, formulatio ns based on 
chance-constraints (e.g., Bhattacharyya et al. . 2004b : Shivaswamv et al. . 20061 ) are mathe- 
matically equivalent to such a min-max formulation. 

We define explicitly the correlated disturbance (or uncertainty) which we study below. 
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Definition 1 A set Mo Q W 1 is called an Atomic Uncertainty Set if 

(I) oeM,; 

(II) For any w £ 1" : sup[w^<5] = sup[— WqS'] < +00. 

S 6' 

We use "sup" here because the maximal value is not necessary attained since Mo may not 
be a closed set. The second condition of Atomic Uncertainty set basically says that the 
uncertainty set is bounded and symmetric. In particular, all norm balls and ellipsoids 
centered at the origin are atomic uncertainty sets, while an arbitrary polytope might not 
be an atomic uncertainty set. 

Definition 2 Let Mo be an atomic uncertainty set. A set M C R nxm i s called a Sublinear 
Aggregated Uncertainty Set of Mo, if 

M~ c jV c M + , 

III 

where: M~ 4 |J M t ~; Mf = {(Si, ■ ■ ■ , S m )\S t G Mo; S^ t = 0}. 

t=\ 

m 

M + = {(atidi, ■ ■ ■ ,a m 6 m )\ = 1; «i > 0, Si G M , i = 1, ■ • • ,m}. 

i=l 

The Sublinear Aggregated Uncertainty definition models the case where the disturbances 
on each sample are treated identically, but their aggregate behavior across multiple samples 
is controlled. Some interesting examples include 

m 

(1) {(s lr -- ,s m )\J2\\8i\\<c}; 

i=l 

(2) {(*!,-•• ,S rn )\3te [l:m]; \\S t \\ < c; S l = 0yi^t}; 

m 

(3) {(S u --- ,s m )\J2V4M <c}. 

1=1 

All these examples have the same atomic uncertainty set Mo = {S\ \\S\\ < c}. Figure [1] 
provides an illustration of a sublinear aggregated uncertainty set for n = 1 and m = 2, i.e., 
the training set consists of two univariate samples. 

Theorem 3 Assume {xj,yj}^ =1 are non- separable, r(-) : M n+1 — > R is an arbitrary func- 
tion, M is a Sublinear Aggregated Uncertainty set with corresponding atomic uncertainty 
set Mo- Then the following min-max problem 

min sup <^ r(w, b) + max [l - yi((w, Xj - Si) + b), 0] > (3) 

is equivalent to the following optimization problem on w, b, 

m 

min : r(w, b) + sup (w T d) + £j, 

(4 ) 

s.t. : & > 1 - [yi((w, Xf) + 6)], & = 1, ...,m; 
& > 0, i = 1, ... ,m. 
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a. N b. J\f + c. A" d. Box uncertainty 

Figure 1: Illustration of a Sublinear Aggregated Uncertainty Set M. 



Furthermore, the minimization of Problem is attainable when r(-, •) is lower semi- 
continuous. 

Proof Define: 



(w,6) = sup (w T <5) + y^max [l - jft((w,Xj) + b), 0] 
<5€A/"o 



i=l 



Recall that AA C AA C by definition. Hence, fixing any (w, b) 6 R n+1 , the following 
inequalities hold: 



sup y^max [l - 7/j((w,Xj - + b), 0] 
(<Si,- ,8 m )eJ\f- i=1 

m 

< sup } max [l - yj((w, x.j - Sj) + 6), 0] 

(<5i,--- ,<5 m )eAf i=1 
m 

< sup max [l - j/j((w, x, - + 6), 0] . 



To prove the theorem, we first show that v(w,b) is no larger than the leftmost expression 
and then show v(w, b) is no smaller than the rightmost expression. 

Step 1: We prove that 



v(w,b) < sup y^max [l - ^((w,x^ - Sj) + b), 0] 
,5 m )eAf- i=1 



(5) 



Since the samples {xj, yj}^! are not separable, there exists t £ [1 : m] such that 



yt«w,x t ) + 6) < 0. 



(6) 
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Hence, 



sup ^2 max [l - 2/i((w,Xj - Si) + b), 0] 
(Si,- ,<5 m )eA^ i=i 



= y^max [l - yj((w,Xj) + b), 0] + sup max [l - y t ({w,x t - S t ) + b), 0] 

= max [l - yi((w, x») + 5), 0] + max [l - t/ t ((w, x t ) + b) + sup (y t w T S t ), 0] 

= y^max [l - 2/i((w,Xj) + 6), 0] + max [l - y t ({w,x t ) + b), 0] + sup (y t w T S t ) 

m 

= sup (w T d) + \^max Tl - ?/j((w,Xj) + 6),Ol =v(w,b). 
SeAfo 

The third equality holds because of Inequality © and sup St& j^- (ytw T St) being non-negative 
(recall £ A/o). Since C AA _ , Inequality ^ follows. 
Step 2: Next we prove that 

m 

sup V^rnax [l - yj((w,Xj - Si) + b), 0] < v(w,b). (7) 
(<5i,-An)eAf+~i 

Notice that by the definition of J\f + we have 

m 

sup V" max [l - yi((w,Xj - Sj) + b), 0] 

(«!,-, «m)6A/'+ 

m 

= sup max [l - g/j((w, Xj - ajSj) + b), 0] (8) 

E™ia«=i;ai>o;5ieAAo i=1 

m 

= sup y^ max [ sup (l - j/i((w, Xj - atiSi) + b)) , 0] . 
££iOi=l;ai>0; i=1 5«eJVo 

Now, for any jg [1 : m], the following holds, 

max [ sup (l - j/i((w, x, - a^j) + b)), 0] 
<5ieA/" 

= max [l - yi((w, Xj) + b) + a, sup (w T <5j), 0] 

Si&Afo 

<max [l -2/i((w,Xj) + 6), 0] + Oj sup (w T 6j). 
Therefore, Equation ([8]) is upper bounded by 

m m 

y^max [l - jft((w,Xj) + b), 0] + sup ^ a, sup (w T 5j) 

i=i E^=i«<=i;ai>0; i=1 Si&Afo 

m 

= sup (w T 5) + y^max [l - 2/i((w,Xj) + b),0] = «(w,b), 
SeM i=1 
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hence Inequality (JTJ) holds. 

Step 3: Combining the two steps and adding r(w, b) on both sides leads to: V(w, b) £ 

m 

sup ^2 max [l ~~ 2/«(( w ) x ? ~ + &)j 0] + r(w, 6) = u(w, 6) + r(w, b). 

(<5i,- ,<5 m )GA^ i=1 

Taking the infimum on both sides establishes the equivalence of Problem ([3]) and Prob- 
lem (jl]). Observe that sup 5g ^ () w T S is a supremum over a class of affine functions, and 
hence is lower semi-continuous. Therefore v(-,-) is also lower semi-continuous. Thus the 
minimum can be achieved for Problem ([3]), and Problem ([3]) by equivalence, when r(-) is 
lower semi-continuous. ■ 

This theorem reveals the main difference between Formulation (JTJ) and our formulation 
in ([2|). Consider a Sublinear Aggregated Uncertainty set M = {(<5i, • • • , S m )\ YliLi ll^ill — 
c}. The smallest box- type uncertainty set containing J\f includes disturbances with norm 
sum up to mc. Therefore, it leads to a regularization coefficient as large as mc that is linked 
to the number of training samples, and will therefore be overly conservative. 

An immediate corollary is that a special case of our robust formulation is equivalent to 
the norm-regularized SVM setup: 

Corollary 4 Let T = |(<5i, • ■ ■ S m )\ YliLi — c |- If ^ e training sample {xf, y^}^ 

are non- separable, then the following two optimization problems on (w, b) are equivalent 

m 

min : max N max [l — ?/j((w, Xj — 5j) + b) , 0] , (9) 

(5i,-,<5 m )er f-f 

m 

min: c||w|| + ^^max [l — y,/((w, Xj) + 6),0] . (10) 

i=l 

Proof Let A/"o be the dual-norm ball {<5 1 1|5 1|* < c} andr(w,6) = 0. Then sup 1 1 ,5|| * < c (w T (5) = 
c||w||. The corollary follows from Theorem O Notice indeed the equivalence holds for any 
w and b. ■ 

This corollary explains the widely known fact that the regularized classifier tends to be 
more robust. Specifically, it explains the observation that when the disturbance is noise- 
like and neutral rather than adversarial, a norm-regularized classifier (without any robust- 
ness requirement) has a p erformance often superior to a box-typed robust classifier (see 



Trafalis and Gilbertl . 120071 ) . On the other hand, this observation also suggests that the 



appropriate way to regularize should come from a disturbance-robustness perspective. The 
above equivalence implies that standard regularization essentially assumes that the dis- 
turbance is spherical; if this is not true, robustness may yield a better regularization-like 
algorithm. To find a more effective regularization term, a closer investigation of the data 
variation is desirable, e.g., by examining the variation of the data and solving the corre- 
sponding robust classification problem. For example, one way to regularize is by splitting 



2. The o ptimization equivalence for the linear case was observed independently by IBertsimas and Fertisl 
(|2008h . 
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the given training samples into two subsets with equal number of elements, and treating 
one as a disturbed copy of the other. By analyzing the direction of the disturbance and 
the magnitude of the total variation, one can choose the proper norm to use, and a suitable 
tradeoff parameter. 



3. Probabilistic Interpretations 

Although Problem ([3]) is formulated without any probabilistic assumptions, in this section, 
we briefly explain two approaches to construct the uncertainty set and equivalently tune 
the regularization parameter c based on probabilistic information. 

The first approach is to use Problem ([3]) to approximate an upper bound for a chance- 
constrained classifier. Suppose the disturbance (S\, • ■ ■ 8 r m ) follows a joint probability mea- 
sure \i. Then the chance-constrained classifier is given by the following minimization prob- 
lem given a confidence level r\ E [0, 1], 

min : I 

w,b,l 

m 

s.t.: /i{^max[l- 2/i ((w, 6?) +b),0] < /} > 1 — 77. (11) 

i=l 



The for mulations in Shivaswamv et al. ( 20061 ). Lanckriet et al. ( 2002 ) and Bhattacharvva et al 



(|2004ah assume uncorrelated noise and require all constraints to be satisfied with high prob 



ability simultaneously. They find a vector [£1, • • • ,£ m ] T where each £j is the 77-quantile of 
the hinge-loss for sample x[. In contrast, our formulation above minimizes the r/-quantile 
of the average (or equivalently the sum of) empirical error. When controlling this average 
quantity is of more interest, the box-type noise formulation will be overly conservative. 
Problem (jlip is generally intractable. However, we can approximate it as follows. Let 

c* = inf{a|//(^ IMI* <a)>l-rj}- 

i 

Notice that c* is easily simulated given /j. Then for any (w, 6), with probability no less 
than 1 — 77, the following holds, 



^2 max [1 - y»(( w , x » - &i) + ft ), 0] 



< max 2_, max [l ~~ 2/«(( w > x i — $i) + b), 0] . 



T- ||<5i||*<e* ^— ' 



Thus (jTTj) is upper bounded by (flOl) with c = c*. This gives an additional probabilistic 
robustness property of the standard regularized classifier. Notice that following a similar 
approach but with the constraint-wise robust setup, i.e., the box uncertainty set, would 
lead to considerably more pessimistic approximations of the chance constraint. 

The second approach considers a Bayesian setup. Suppose the total disturbance c r = 
YliLi II^III* follows a prior distribution p(-). This can model for example the case that 
the training sample set is a mixture of several data sets where the disturbance magnitude 
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of each set is known. Such a setup leads to the following classifier which minimizes the 
Bayesian (robust) error: 



mm : 

w,6 



/ I max max [l - y { ((w, Xj - Si) + b) , 0] [ dp(c). (12) 
J ^T,\\ s i\\*< c ~[ J 

By Corollary [H the Bayesian classifier (|12p is equivalent to 

„ m 

min : / |c||w|| + ^max [l - y»((w, x») + &),0] | dp(c), 
which can be further simplified as 

m 

min : c||w|| + ^max [l — yj((w, Xj) + 6),0] , 
w ' 5 i=l 

where c — f cdp(c). This thus provides us a justifiable parameter tuning method different 
from cross validation: simply using the expected value of c r . We note that it is the equiva- 
lence of Corollary [4] that makes this possible, since it is difficult to imagine a setting where 
one would have a prior on regularization coefficients. 

4. Kernelization 

The previous results can be easily generalized to the kernelized setting, which we discuss in 
detail in this section. In particular, similar to the linear classification case, we give a new 
interpretation of the standard kernelized SVM as the min-max empirical hinge-loss solution, 
where the disturbance is assumed to lie in the feature space. We then relate this to the 
(more intuitively appealing) setup where the disturbance lies in the sample space. We use 
this relationship in Section [5] to prove a consistency result for kernelized SVMs. 

The kernelized SVM formulation considers a linear classifier in the feature space 7i, a 
Hilbert space containing the range of some feature mapping <£(•). The standard formulation 
is as follows, 



mm : r(w 

w,6 



1=1 



s.t. : &> [l-y i ((w,$(x i ))+6)], 
&>0. 



It has been proved in IScholkopf and Smolal (|2002l ) that if we take /((w,w)) for some in 



creasing function /(•) as the regularization term r(w, b), then the optimal solution has a 
representation w* = a i^( x i)) which can further be solved without knowing explicitly 
the feature mapping, but by evaluating a kernel function fe(x,x') = (4 ) (x), $(x')) only. 
This is the well-known "kernel trick". 

The definitions of Atomic Uncertainty Set and Sublinear Aggregated Uncertainty Set in 
the feature space are identical to Definition [T] and [21 with W 1 replaced by 7i. The following 
theorem is a feature-space counterpart of Theorem [3j The proof follows from a similar 
argument to Theorem El i.e., for any fixed (w, b) the worst-case empirical error equals the 
empirical error plus a penalty term sup 5gA /- ((w, S)), and hence the details are omitted. 
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Theorem 5 Assume {$(xj), yij^Li are not linearly separable, r(-) : H x R — > R is an 
arbitrary function, N C is a Sublinear Aggregated Uncertainty set with corresponding 
atomic uncertainty set Mo C H. T/ien i/te following min-max problem 

min sup < r(w, b) + max fl - j/j((w, <3>(xj) - 5j) + 6), 0] > (13) 
w < 6 (*i,-» I « m )eAr I 



i=l 



is equivalent to 



min: r(w, 6) + sup ((w, <$)) + VV;, 
*eM> ^ 

s.t. : ^ > 1 -j/i((w, $(xj)) +6), i = l,---,m; 

& > 0) i = 1, ■ ■ ■ , m. 



(14) 



Furthermore, the minimization of Problem |i^| ) is attainable when r(-, •) is lower semi- 
continuous. 

For some widely used feature mappings (e.g., RKHS of a Gaussian kernel), {<fr(xj), 
are always separable. In this case, the worst-case empirical error may not be equal to the 
empirical error plus a penalty term sup^^y ((w, 5)). However, it is easy to show that for 
any (w, 6), the latter is an upper bound of the former. 

The next corollary is the feature-space counterpart of Corollary HI where || • ||^ stands 
for the RKHS norm, i.e., for z£W, \\z\\h = \j (z, z). Noticing that the RKHS norm is self 
dual, we find that the proof is identical to that of Corollary and hence omit it. 

Corollary 6 Let T n = I (Si, ■ • • 6 m )\ YnLi W^iWn < c }- #{$( x i)j Ui}i^i are non-separable, 
then the following two optimization problems on (w, b) are equivalent 

in 

min: max max [l — j/j((w, ^(x^) — Sj) + fc) , 0] , (15) 



min: c||w|| w + ^max [l - t/j((w, $(Xj)) + 6),0] . (16) 



Equation (|16|) is a variant form of the standard SVM that has a squared RKHS norm 
regularization term, and it can be shown that the two formulations are equivalent up to 
changing of tradeoff parameter c, since both the empirical hinge-loss and the RKHS norm 
are convex. Therefore, Corollary [6] essentially means that the standard kernelized SVM is 
implicitly a robust classifier (without regularization) with disturbance in the feature-space, 
and the sum of the magnitude of the disturbance is bounded. 

Disturbance in the feature-space is less intuitive than disturbance in the sample space, 
and the next lemma relates these two different notions. 

Lemma 7 Suppose there exists X C W 1 , p > 0, and a continuous non- decreasing function 
f : E+ -> R + satisfying /(0) = 0, such that 

/c(x,x) -^(x^x') -2fc(x,x') < /(||x-x'|||), Vx,x' e X, ||x-x'|| 2 < p 
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then 

||$(x + *)-$(x)|| w < JJmg), V||tf||2<p, x,x + 5a. 

In the appendix, we prove a result that provides a tighter relationship between disturbance 
in the feature space and disturbance in the sample space, for RBF kernels. 
Proof Expanding the RKHS norm yields 

||*(x + $)-*(x)|| w 
=V($(x + 8)- $(x), $(x + S) - $(x)) 

=V<$(* + <5), $(x + d)) + <$(x), $(x)> - 2($(x + 6), $(x)> 
= y / 'fc(x + <5, x + <5) + fc(x, x) - 2/c(x + <5, x) 
<^/(||x + <5-x|||) = ^/(||<5|||), 

where the inequality follows from the assumption. ■ 

Lemma [7] essentially says that under certain conditions, robustness in the feature space is 
a stronger requirement that robustness in the sample space. Therefore, a classifier that 
achieves robustness in the feature space (the SVM for example) also achieves robustness in 
the sample space. Notice that the condition of Lemma [7] is rather weak. In particular, it 
holds for any continuous &(■, •) and bounded X. 

In the next section we consider a more foundational property of robustness in the sam- 
ple space: we show that a classifier that is robust in the sample space is asymptotically 
consistent. As a consequence of this result for linear classifiers, the above results imply the 
consistency for a broad class of kernelized SVMs. 



5. Consistency of Regularization 

In this section we explore a fundamental connection between learning and robustness, by 
using robustness properties to re-prove the statistical consistency of the linear classifier, 
and then the ker nelized SVM. Indeed, our proof mirrors the consistency proof found in 
(jSteinwarti . boosh . with the key difference that we replace metric entropy, VC- dimension, 



and stability conditions used there, with a robustness condition. 

Thus far we have considered the setup where the training-samples are corrupted by 
certain set-inclusive disturbances. We now turn to the standard statistical learning setup, 
by assuming that all training samples and testing samples are generated i.i.d. according to 
a (unknown) probability P, i.e., there does not exist explicit disturbance. 

Let X C W 1 be bounded, and suppose the training samples (xi,yi)^. 1 are generated i.i.d. 
according to an unknown distribution ¥ supported by X x {—1, +1}. The next theorem 
shows that our robust classifier setup and equivalently regularized SVM asymptotically 
minimizes an upper-bound of the expected classification error and hinge loss. 

Theorem 8 Denote K = max l£ ^ H^lb- Then there exists a random sequence {7 m ,c} such 
that: 

1. Vc > ; lim m ^ 00 7 mjC = almost surely, and the convergence is uniform in P; 
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2. the following bounds on the Bayes loss and the hinge loss hold uniformly for all (w, b): 

^(x,j/)~p(lj/^sgn((w, x)+b)) 



E^.,A^p(lj / ^ 9n (( WiX ) +f) )) < 7m,c + c||w|| 2 + -^max [l - j/j((w, x;) +6),0]; 



m 
1=1 



E (x,j/)~p(max(l - y((w, x) + 6), 0)) < 



j m 

7 m , c (l + if ||w|| 2 + \b\) + c||w|| 2 H y^max [l - y;((w, x») + 6) , 0] . 

i=l 

Proof We briefly explain the basic idea of the proof before going to the technical de- 
tails. We consider the testing sample set as a perturbed copy of the training sample set, 
and measure the magnitude of the perturbation. For testing samples that have "small" 
perturbations, c||w|| 2 + — X)I=i max [l ~~ y«(( w > x «) + b),0] upper-bounds their total loss 
by Corollary |U Therefore, we only need to show that the ratio of testing samples having 
"large" perturbations diminishes to prove the theorem. 

Now we present the detailed proof. Given a c > 0, we call a testing sample (x',y') and 
a training sample (x, y) a sample pair if y = y' and ||x — x'|| 2 < c. We say a set of training 
samples and a set of testing samples form I pairings if there exist I sample pairs with no 
data reused. Given m training samples and m testing samples, we use M m>c to denote 
the largest number of pairings. To prove this theorem, we need to establish the following 
lemma. 

Lemma 9 Given a c > 0, M mtC /m — ► 1 almost surely as m — > +oo, uniformly w.r.t. P. 

Proof We make a partition of X x {—1, +1} = Ut=i such that Xt either has the form 
\ai,a\ + c/y/n) x [q 2 , a2 + c/y/n) • • • x [a n , a n + c/y/n) x {+1} or [ai, a\ + c/y/n) x [a 2 ,a 2 + 
c/y/n) • • • x [a n ,a n + c/ y/n) x { — 1} (recall n is the dimension of X). That is, each partition 
is the Cartesian product of a rectangular cell in X and a singleton in { — 1, +1}- Notice that 
if a training sample and a testing sample fall into Xt, they can form a pairing. 

Let N^ r and Nj: e be the number of training samples and testing samples falling in 
the t th set, respectively. Thus, (iV"* r , • • • ,N?f) and (A^ 6 ,-'' ^t c ) are multinomially dis- 
tributed random vectors following a same distribution. Notice that for a multinomially dis- 
tributed random vector (N\, ■ ■ ■ , iV fc ) with parameter m and (pi, ■ • • , P&), the following holds 



{Breteganolle-Huber-Carol inequality, see for example Proposition A6.6 of lvan der Vaart and Wellner 
200d ). For any A > 0, 



K 

\Ni - mpi\) > 2v^A) < 2 fc exp(-2A 2 ). 



i=i 



Hence we have 



K r - N l t e \ > 4 V / ^AJ < 2 Tc+1 exp(-2A 2 ), 

t=i 

^EK"iVf|>A)<2^exp(^), 



m 

t=i 



'(M ro , c /m < 1 - A) < 2 T = +1 exp(^^), (17) 
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Observe that 7^-1 2 7 " c +1 exp( ~"g A ) < +00, hence by the Borel-Cantelli Lemma (see for 
example Durrett . 20041 ) . with probability one the event {M mfi /m < 1 — A} only occurs 
finitely often as m — > 00. That is, lim inf m M mtC /m > 1 — A almost surely. Since A can 
be arbitrarily close to zero, M m ^ c /m — ► 1 almost surely. Observe that this convergence is 
uniform in P, since T c only depends on X. ■ 

Now we proceed to prove the theorem. Given m training samples and m testing samples 
with M m c sample pairs, we notice that for these paired samples, both the total testing error 
and the total testing hinge-loss is upper bounded by 

m 

max > max [l — yA (w, Xj — Ss) + b) , 01 

(8 1 ,..,8 m )eN x-xAr A i 1 U ' ' J 

in 



<cm||w|| 2 + y^max [l - Ui({w, Xj) + b), 0], 



i=l 



where A/o = {<5 | ||<5|| < c}. Hence the total classification error of the m testing samples can 
be upper bounded by 

m 

(m - M m>c ) + cm||w|| 2 + y^max [l - j/j((w, Xj) + 6), 0], 

i=l 

and since 

max(l -y((w,x))) < maxjl + |6| + V(x,x) • (w,w)| = 1 + |6| +Jf||w|| 2l 
the accumulated hinge-loss of the total m testing samples is upper bounded by 

m 

(m - M m , c )(l + ir||w|| 2 + |6|) + cm||w|| 2 + ^ max [l - Vi ((w, x,) + 6), 0]. 

i=i 

Therefore, the average testing error is upper bounded by 



1 " 

1 - M miC /m + c||w|| 2 + — Vmax [l - yi ((w, x») + 6), 0], (18) 

i=l 

and the average hinge loss is upper bounded by 

(1 - M m , c /m)(l + K||w|| 2 + |6|) + c||w|| 2 + — V max [l - yi ((w, xj) + 6), O] . 

7T) fa L J 



m 
i=i 



Let 7 mjC = 1 — M mjC /m. The proof follows since M m-c /m — > 1 almost surely for any c > 0. 
Notice by Inequality (fT7|) we have 



7m,c > A) < exp (-mA 2 /8 + (T c + 1) log 2) , (19) 
i.e., the convergence is uniform in P. 
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We have shown that the average testing error is upper bounded. The final step is to 
show that this implies that in fact the random variable given by the conditional expecta- 
tion (conditioned on the training sample) of the error is bounded almost surely as in the 
statement of the theorem. To make things precise, consider a fixed m, and let lo\ E Vt\ 
and iv 2 £ Sl 2 generate the m training samples and m testing samples, respectively, and for 
shorthand let T rn denote the random variable of the first m training samples. Let us denote 
the probability measures for the training by p\ and the testing samples by pi- By indepen- 
dence, the joint measure is given by the product of these two. We rely on this property in 
what follows. Now fix a A and a c > 0. In our new notation, Equation (|19|) now reads: 



l{7m,c(wi,W2) > A}dp 2 (w 2 )dpl(wi) = P 7m.c(wi,W2) > A) 
Qt JQ 2 V 7 

< exp f-mA 2 /8 + (T c + l)log2). 



We now bound P^ 1 QEw 2 [7m,e(k ; i>k ; 2) \ T m ] > A), and then use Borel-Cantelli to show that 
this even can happen only finitely often. We have: 

F ull (E u>2 [ 7m>c (u u u; 2 )\T m ] > A) 

= / l{ / 7^,0(^1,^2)^2(^2) > \}dp\{(jJ\) 
JUx Jn 2 

< l\[ 7m,c(wi,W2)l(7m,c(wi,CU 2 ) < A)dp2(w2) + 



7m,c(wi,W2)l(7m,c(wi,W2) > A)dp 2 (^2)] > 2\jdpi(ui) 

< [ l{[ [ A1(A(W1,W2) < \)d P2 {L0 2 ) + 

JUi 1 Jn 2 

/ l(7 m , c (u;i, u 2 ) > \)dp 2 (uj 2 )} > 2X\dpi(ui) 
JQ 2 } 

< / l{[A+ / l(7 mjC (wi,W2) > X)dp 2 {uj 2 )\ >2AW(wi) 

JQ 1 1 JQ 2 J 

-(7m,c(wi,W2) > \)dp 2 (uo 2 ) > Xjdpi(L0i). 



1{ 1( 

ill k JQ 2 

Here, the first equality holds because training and testing samples are independent, and 
hence the joint measure is the product of p\ and p 2 . The second inequality holds because 
7m,c(<^i 5 ^2) < 1 everywhere. Further notice that 

l{7m.c(wi, w 2 ) > A} dp 2 (u 2 ) dpiifJi) 



> / AIM II 

'Qj WQ 2 



(7m,c(wi,w 2 ) > A) dp(oj 2 ) > \ jdpi((Ji). 
Thus we have 

P(E W3 (7m,c(w 1 ,c^)) > A) < P(7m,c(wi,w2) > a) /A < exp (-mA 2 /8 + (T c + 1) log 2) /A. 
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For any A and c, summing up the right hand side over m = 1 to oo is finite, hence the 
theorem follows from the Borel-Cantelli lemma. ■ 



Remark 10 We notice that, M m /m converges to 1 almost surely even when X is not 
bounded. Indeed, to see this, fix e > 0, and let X' C X be a bounded set such that 
W(X') > 1 — e. Then, with probability one, 

^(unpaired samples inX')/m — ► 0, 

by Lemma In addition, 

max (^(training samples not in X'), ^(testing samples not in X')\ /m — > e. 

Notice that 

M m > m — # (unpaired samples in X') 

— max (# (training samples not in X'), # (testing samples not in X')\ 

Hence 

lim M m /m > 1 — e, 

m— >oo 

almost surely. Since e is arbitrary, we have M m /m — > 1 almost surely. 

Next, we prove an analog of Theorem [8] for the kernelized case, and then show that 
these two imply statistical consistency of linear and kernelized SVMs. Again, let X C M n 
be bounded, and suppose the training samples (xj,yj)^ 1 are generated i.i.d. according to 
an unknown distribution P supported on X x { — 1, +1}. 

Theorem 11 Denote K = max xg ^ k(x, x). Suppose there exists p > and a continuous 
non- decreasing function f : IR + — ► M + satisfying /(0) = 0, suc/i i/iat: 

fc(x,x) + A;(x / ,x / ) - 2fc(x,x') < /(||x - x'|||), Vx,x' G ||x-x'|| 2 < p. 

T/ien i/iere exists a random sequence {jm tC } such that, 

1. Vc > 0, lim m ^ 00 7 miC = almost surely, and the convergence is uniform in P; 

2. the following bounds on the Bayes loss and the hinge loss hold uniformly for all (w, b) G 

Ep(lj / ^ S5 „({w,$(x))+b)) < 7m,c + c||w|| w H ^max [l -^((w, $(x;)) + &),0], 

i=l 

E (Xi j /) ^ P (max(l -y((w, $(x)> +6), 0)) < 

j 771 

7 m>c (l + A"||w|| w + + c||w|| w + — Vmax [l - y 4 ((w, $(x()) + 6), 0] . 

i=i 



17 



Xu, Caramanis and Mannor 



Proof As in the proof of Theorem [81 we generate a set of m testing samples and m training 
samples, and then lower-bound the number of samples that can form a sample pair in the 
feature-space; that is, a pair consisting of a training sample (x, y) and a testing sample 
(x', y') such that y = y' and ||^(x) — 3>(x')||ft < c. In contrast to the finite-dimensional 
sample space, the feature space may be infinite dimensional, and thus our decomposition 
may have an infinite number of "bricks." In this case, our multinomial random variable 
argument used in the proof of Lemma [9] breaks down. Nevertheless, we are able to lower 
bound the number of sample pairs in the feature space by the number of sample pairs in 
the sample space. 

Define f~ 1 (a) = max{/3 > 0|/(/3) < a}. Since /(•) is continuous, f~ 1 (a) > for any 
a > 0. Now notice that by Lemma if a testing sample x and a training sample x' belong 
to a "brick" with length of each side mm(p/y/n, /~ 1 (c 2 )/y / n) in the sample space (see the 
proof of Lemma[9|), ||3>(x) — <&(x / )||-^ < c. Hence the number of sample pairs in the feature 
space is lower bounded by the number of pairs of samples that fall in the same brick in 
the sample space. We can cover X with finitely many (denoted as T c ) such bricks since 
/~ 1 (c 2 ) > 0. Then, a similar argument as in Lemma shows that the ratio of samples 
that form pairs in a brick converges to 1 as m increases. Further notice that for M paired 
samples, the total testing error and hinge-loss are both upper-bounded by 

M 

cM||w|| w + max t 1 " ^« w > $( x *)> + b), 0] . 

i=l 

The rest of the proof is identical to Theorem El In particular, Inequality (j!9j) still holds. ■ 

Notice that the condition in Theorem [TT] is satisfied by most widely used kernels, e.g., 
homogeneous polynominal kernels, and Gaussian RBF. This condition requires that the 
feature mapping is "smooth" and hence preserves "locality" of the disturbance, i.e., small 
disturbance in the sample space guarantees the corresponding disturbance in the feature 
space is also small. It is easy to construct non-smooth kernel functions which do not 
generalize well. For example, consider the following kernel: 

^ (x ' x/) = { xjx'l 

A standard RKHS regularized SVM using this kernel leads to a decision function 

m 

sign(^aifc(x,Xj) + 6), 
i=i 

which equals sign(6) and provides no meaningful prediction if the testing sample x is not 
one of the training samples. Hence as m increases, the testing error remains as large as 50% 
regardless of the tradeoff parameter used in the algorithm, while the training error can be 
made arbitrarily small by fine-tuning the parameter. 



Convergence to Bayes Risk 



Next we relate the results of Theorem [8] a nd Theorem 1111 to the standard consistency 
notion, i.e., convergence to the Bayes Risk (jSteinwartl . 120051 ) . The key point of interest 
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in our proof is the use of a robustn ess condition in place of a VC-di mension or stability 
condition used in (jSteinwartl . l2005h . The proof in (jSteinwartl . l2005l l has 4 main steps. 
They show: (i) there always exists a minimizer to the expected regularized (kernel) hinge 
loss; (ii) the expected regularized hinge loss of the minimizer converges to the expected 
hinge loss as the regularizer goes to zero; (iii) if a sequence of functions asymptotically 
have optimal expected hinge loss, then they also have optimal expected loss; and (iv) the 
expected hinge loss of the minimizer of the reg ularized training hinge loss concentrates 
around the empirical regularized hinge loss. In ( Steinwartl . 120051 ). this final step, (iv), is 
accomplished using concentration inequalities derived from VC-dimension considerations, 
and stability considerations. 

Instead, we use our robustness-based resul ts of Theo r em El and Theorem 1111 to replace 
these approaches (Lemmas 3.21 and 3.22 in (jSteinwartl . 120051 )) in proving step (iv), and 
thus to establish the main result. 

Recall that a classifier is a rule that assigns to every training set T = {xj,yj}™ 1 a 
measurable function fx- The risk of a measurable function / : X — ► R is defined as 



ft P (/)4p({x,y:sign/(x)^}). 



The smallest achievable risk 



7£p = inf{7?.p(/)|/ measurable} 

is called the Bayes Risk of P. A classifier is said to be strongly uniformly consistent is for 
all distributions P on X X [— 1, +1] , the following holds almost surely. 

lim TZf(f T ) = IZp- 

m— >oo 

Without loss of generality, we only consider the kernel version. Recall a definition from 
Steinwartl (j2005h . 



Definition 12 Let C(X) be the set of all continuous functions defined on X . Consider the 
mapping I : Tt — > C(X) defined by Iw = (w, $(•)). // / has a dense image, we call the 
kernel universal. 

Roughly speaking, if a kernel is universal, it is rich enough to satisfy the condition of step 
(ii) above. 

Theorem 13 // a kernel satisfies the condition of Theorem and is universal, then the 
Kernel SVM with c { sufficiently slowly is strongly uniformly consistent. 



Proof We first introduce some notation, largely following (jSteinwartl . 120051 ). For some 
probability measure /x and (w, b) E TL x R, 

^((w, b)) ^ E (X)W) „ M { max(0, 1 - y((w, $(x)) + b))} , 

is the expected hinge- loss under probability /j,, and 

fl£ )M ((w, b)) 4 C || W ||„ + E (X)2/W { max(0, 1 - y((w, $(x)> + &))} 



19 



Xu, Caramanis and Mannor 



is the regularized expected hinge-loss. Hence Rl,p(-) an d R c lp{') are the expected hinge- 
loss and regularized expected hinge-loss under the generating probability P. If fx is the 
empirical distribution of m samples, we write RL,m{') and R% m {-) respectively. Notice 
R C L m (-) is the objective function of the SVM. Denote its solution by f m>ci i.e., the classifier 
we get by running SVM with m samples and parameter c. Further denote by /p c £WxR 
the minimizer o f i?£ p (-). The existence of such a minimizer is proved in Lemma 3.1 of 



Steinwartl (|2005l ) (step (i)). Let 

11l,p = min E Xi3/ ~pj max(l - y/(x), 0)}, 

/ measurable L 

i.e., the smallest achievable hinge-loss for all measurable functions. 

Th e main content of our proof is to use Theorems [8l and [TT1 to prove step (iv) in lSteinwart 
(|2005h . In particular, we show: if c J, "slowly" , we have with probability one 

lim R L Afm,c) = K L ,P- (20) 

m— >oo 

To prove Equation (|20|) . denote by w(/) and b(f) as the weight part and offset part of any 
classifier /. Next, we bound the magnitude of / m)C by using R C L m (/ m ,c) — -^Lm(0>0) — 1> 
which leads to 

||w(/ mjC )|| w < l/c 

and 

|6(/m,c)| < 2 + K\\^{f m>c )\\ H < 2 + K/c. 
^From Theorem 1111 (note that the bound holds uniformly for all (w, b)), we have 

R L Afm,c) < 7mA 1 + K\\w(f mtC )\\ H + \b\] + R% m (f m ,c) 

< lm!C [S + 2K/c]+Rl m (f m , c ) 

< 7mtC [3 + 2K/c}+R% m (f FtC ) 

= n L , P + 7m , c [3 + 2K/c] + {R% m (fp, c ) - RIAMc)} + {^,p(/p, c ) - n L , F } 

= K LjF + 7m , c [3 + 2K/c] + {^^(/p.c) - Rl,p(Mc)} + {^£,p(/p,c) " ^l,p}- 

The last inequal ity holds becaus e f m ^ c minimizes R c Lm - 

It is known (jSteinwartl . 120051 . Proposition 3.2) (step (ii)) that if the kernel used is rich 
enough, i.e., universal, then 

lim R c LF (fe,c) = TZ L ,w- 

c~^0 ' 

For fixed c > 0, we have 

lim RL, m (fp,c) = #l,p(/p,c), 

m— >oo 

almost surely due to the strong law of large numbers (notice that /p jC is a fixed classifier), 
and 7m iC [3 + 2K/c] — ► almost surely. Notice that neither convergence rate depends on P. 
Therefore, if c j sufficiently slowlyp we have almost surely 

lim RL,p(fm,c) < 



3. For example, we can take {c(m)} be the smallest number satisfying c(m) > m 1 ' 8 and T c ( m ) < 
m 1//8 /log2 — 1. Inequality (I19|l thus leads to X)m=i -P(7m, c (m)/ C ( m ) > m 1 ^ 4 ) < +oo which implies 
uniform convergence of 7 m , c ( m )/c(m). 
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Now, for any m and c, we have RL,p(fm,c) > ^-l,p by definition. This implies that Equa- 
tion ([20]) holds almost surely, th us giving us step (iv). 

Finally, Proposition 3.3. of ( Steinwartl . 20051 ) shows step (hi), namely, approximating 
hinge loss is sufficient to guarantee approximation of the Bayes loss. Thus Equation (|20p 
implies that the risk of function / mjC converges to Bayes risk. ■ 



6. Concluding Remarks 

This work considers the relationship between robust and regularized SVM classification. In 
particular, we prove that the standard norm-regularized SVM classifier is in fact the solution 
to a robust classification setup, and thus known results about regularized classifiers extend 
to robust classifiers. To the best of our knowledge, this is the first explicit such link between 
regularization and robustness in pattern classification. This link suggests that norm-based 
regularization essentially builds in a robustness to sample noise whose probability level sets 
are symmetric, and moreover have the structure of the unit ball with respect to the dual of 
the regularizing norm. It would be interesting to understand the performance gains possible 
when the noise does not have such characteristics, and the robust setup is used in place of 
regularization with appropriately defined uncertainty set. 

Based on the robustness interpretation of the regularization term, we re-proved the 
consistency of SVMs without direct appeal to notions of metric entropy, VC-dimension, or 
stability. Our proof suggests that the ability to handle disturbance is crucial for an algorithm 
to achieve good generalization ability. In particular, for "smooth" feature mappings, the 
robustness to disturbance in the observation space is guaranteed and hence SVMs achieve 
consistency. On the other-hand, certain "non-smooth" feature mappings fail to be consistent 
simply because for such kernels the robustness in the feature-space (guaranteed by the 
regularization process) does not imply robustness in the observation space. 
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Appendix A. 

In this appendix we show that for RBF kernels, it is possible to relate robustness in the 
feature space and robustness in the sample space more directly. 

Theorem 14 Suppose the Kernel function has the form fc(x, x') = /(||x — x'||), with f : 
R + — > R a decreasing function. Denote by TC the RKHS space of &(•,•) an d ^KO the 
corresponding feature mapping. Then we have for any x G W 1 , w £ Ji and c > 0, 

sup (w, $(x - 5)) = sup (w, $(x) +3$). 

H*l^ c !l<M«<V 2 /(°)- 2 /( c ) 
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Proof We show that the left-hand-side is not larger than the right-hand-side, and vice 
versa. 

First we show 

sup (w, <3?(x - 8)) < sup (w, $(x) - 8$). (21) 

||5^||w<V 2 /( )- 2 /( c ) 

We notice that for any < c, we have 

(w, $(x-<5)> 

=/w, $(x) + ($(x -6)- *(x))\ 

= (w, $(x)> + (w, $(x - 8) - $(x)) 
<(w, $(x)) + ||w|| w • ||$(x - 8) - $(x)||« 
<(w, $(x)> + ||w|| w ^2/(0)-2/(c) 
= sup (w, $(x) - <5^). 

ll«5^l|w<V 2 /(°)- 2 /( c ) 

Taking the supremum over 8 establishes Inequality (|2ip . 
Next, we show the opposite inequality, 

sup (w, $(x - 5)) > sup (w, $(x) - 5^). (22) 

H*l^ c ll<5 || w <v/2/(O)-2/( C ) 

If /(c) = /(0), then Inequality [22] holds trivially, hence we only consider the case that 
/(c) < /(0). Notice that the inner product is a continuous function in Ti, hence for any 
e > 0, there exists a 8'^ such that 

(w, $(x)-^)> sup (w, $(x) - <^> - e; \\8^\\ H < ^2/(0) - 2/(c). 

ii^Hh<V 2 /(°)- 2 /( c ) 

Recall that the RKHS space is the completion of the feature mapping, thus there exists a 
sequence of {x^} G IR n such that 

*(xi)-<I>(x)-^, (23) 

which is equivalent to 

- *(x)) - 

This leads to 

hm v /2/(0)-2/(|| 3 <-x||) 

i— >oo V 

= hm ||$(xj)-*(x)|| w 

=||^||« < V2/(0)-2/(c). 

Since / is decreasing, we conclude that ||x£ — x|| < c holds except for a finite number of i. 
By P3]) we have 

(w, $(xj)) -» (w, $(x) - ^) > sup (w, $(x) - fy) - e, 

ll<5 || w <V 2 /(O)-2/(c) 
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which means 



sup (w, 3>(x - 6)) > sup (w, <&(x) — 8$) — e. 

H*l^ c ll^|| W <V2/(0)-2/(c) 



Since e is arbitrary, we establish Inequality (|22p . 

Combining Inequality (|21|) and Inequality (|22p proves the theorem. 
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